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Digital Computer Laboratory 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


SUBJECTi DIAGNOSTIC PROGRAMS AND MARGINAL CHECKING_IN THE WHIRLWIND I COMPUTER 
(Text of paper presented at Che Sew lork Convention of the Institute 
of Radio Engineers on March 2li, 1953) 

Tot Stephan H. Dodd 

Frost N. L. Daggett, 2. S. Riba 

Datei March 26, 1953 

Abstract! In the Whirlwind I computer, constructed at MIT under Office of 
Naval Research sponsorship and presently operated under Joint 
Services support, it hs* been found that marginal checking 
vastly reduces the machine failure rate. A series of teat 
programs each of which thoroughly exsroiBes a different section 
of the maohiae ie used in the marginal checking procedure. 

Marginal oheeking oannot prevent intermittent and total failures 
oauied by charts and opens. These are isolated by methods com¬ 
bining built-in checking fsiturea, diagnostic progranmlng, signal 
tracing, and operator experience and Ingenuity. These methods 
are greatly facilitated by a special para gran control which allows 
r periodically repeated test program to be stopped at an arbi¬ 
trary point to studs' indicator lights and •igrti w»ve?or*». 
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1.0 INTRODUCTION 


Through four years of experience in maintaining the Whirlwind I 
computer* several improvements in trouble location techniques over those 
originally conceived have been worked out. This experience provided 
knowledge of what types of failures must be dealt with, what procedures 
are most effective, and what special features are helpful to an operator 
in localizing trouble. The Whirlwind computer was constructed at MIT 
under sponsorship of the Office of Naval Research and is presently op¬ 
erated under support of the Joint Services. I will first discuss briefly 
the types of faults which are encountered, then will outline basic phil¬ 
osophies of failure diagnosis which arc peculiar to the machine. Next 
I will describe facilities provided to aid an operator in his diagnoses, 
and finally will illustrate the actual procedures which are in use. 


2.0 FAULTS TO BE DIAGNOSED 

Faults in the computer system a. e classified into four cate¬ 
gories. Three of these are well known and typical of any electronic 
equipment. They are (l) gradual deterioration, (2) sudden failures 
such as shorts or opens, and (3) intermittent or transient failures. 

The fourth category is peculiar to an experimental machine in which 
modification and expansion is being carried out. Since the central 
portion of the computer became operative, there has beer, a continuing 
program to expand the internal storage capacity and the terminal equip¬ 
ment facilities. Because of this work, it is necessary to contend with 
faults that are the result of maladjustment ana weaknesses in newly- 
installed equipment. These then form the fourth category. 

With the procedures which have been worked out in Whirlwind I, 
it has been found that the faults which can be located most easily are 
sudden complete failures. Gradual deterioration and defects associated 
with newly-installed equipment also are relatively easy to find. Inter¬ 
mittent failures, however, are difficult to deal with and therefore are 
considered the most serious. 


3.0 PHILOSOPHY OF FAILURE DIAGNOSIS 


It is to be expected that the trouble location methods used 
in a computer reflect its logiaal design. In Whirlwind, these trouble 
location methods also reflect the mechanical arrangement of the system. 
When the Whirlwind computer j?ss being planned, it was felt that panels 
should be constructed so that all component connections would be readily 
accessible while the system was in operation. This would facilitate 
signal tracing with video probes while the system was first being checked 
out, and would side-step many packaging problems. With this extremely 
open typo of construction, it has been found more practical to repair 
circuits in place rather than to substitute spare panels. Obviously 
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this makes trouble location procedures more complex. FaultB must be 
completely isolated rather than merely localised to a given panel or 
chassis. A strong argument in favor of such an arrangement is that 
the computer can be used as a powerful testing device. Bench testing, 
with necessarily limited facilities for signal generation and detection, 
sometimes may not shew up all the malfunctions in a circuit. 

Another mechanical design feature ie reflected in the trouble 
location methods now employed. It is the layout of the computer'e 
control center which consists of a flexible arrangement of panels in 
standard racks rather than a relatively fixed operating console. This 
has encouraged the installation of special machine controls and special 
facilities for monitoring critical signals for testing purposes. Of 
particular value is some equipment- which can be used to change the 
over-all logic of the machine control. I will describe this later in 
my talk. 


Aw a final point on the philosophy of failure diagnosis in 
Whirlwind, considerable emphasis is placed on marginal checking. By 
discovering deteriorating circuits before they cause trouble the number 
of interrupting failures can be kept low. The possibility of a deter¬ 
iorating component causing intermittent failures, the type moot difficult 
to isolate, ie virtually eliminated. 


li.O E QUIPMENT AIDS IN TROUBLE LOCATION 


X have Just described the types of faults to be diagnosed and 
some special characteristics of the computer which have influenced the 
choice of trouble location methods used. Now a brief discussion of the 
equipment provided to aid in trouble diagnosis will complete the back¬ 
ground needed for an explanation of the actual checking procedures used. 

li.l Built-In Alarms 


An important aid to the operator, in fact the one around which 
nearly all of the trouble location procedures are centered, is a system 
of built-in alarms. There are a total of eight different alarm indi¬ 
cations. A ay one of these will stop the computer operation when the 
alarm occurs. These eight alarms are evenly distributed among the four 
main subdivisions cf the computer, the central control, the arithmetic 
element, the internal memory, and the input-output element. Generally 
speaking, they are designed to monitor the operation of critical control 
circuits or to show up certain cases of nonpar ndssible programming. One 
of the alarms, a transfer check, is applied more frequently than the 
others and covers all sections of the computer. It checks that words 
transferred between registers by means of the common bus system are 
correctly received. The check is accomplished by a special register 
which receives the word by two different paths, one directly from the 
main bus and the second from the receiving register via a second check 
bus. 
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The special identity checking facilities of the transfer check 
have been used for implementing an identity check order as a part of the 
standard order code. This order piakes it possible for a programmer to 
arbitrarily command a check on the contents of the accumulator against 
a wo d stored in the memory. Such an order obviously is valuable in 
trouble location and diagnostic programming work. 

U.2 Harglnal Checking Equip ment 

For locating gradual deterioration, the marginal checking 
system is the principal tool. Marginal checking coneists of variation 
of certain d-c supply voltages to the tubes rather than variation of 
heater voltages„ The circuits for marginal checking are an integral 
part of the power distribution system and are so deal grind that voltage 
variation can take place in only a small section of the computer at a 
time. The whole computer is divided into about two hundred such sections. 
These may be chosen manually or in an automatic sequence during marginal 
checking procedures. Insofar as possible the neetionalizatien was done 
sc that logically dependent parts of the computer are on different voltage 
variation circuits. This combines a powerful trouble location feature 
with the ability to determine whether the system performance is deter- 
io-'.lng. 


U=3 Cyclic Pregram Control 

Sudden failures and certain types of intermittent failures 
require a diagnostic approach different from that for deteriorating 
components. To assist in a detailed analysis of such troubles, a special 
computer control fsature, called a cyclic program control, has been pro¬ 
vided, Basically the cyclic program control permits a change in machine 
logic. It makes It possible to interpret flip-flop indicators and signal 
waveforms while preserving normal high-speed operation of the particular 
program giving the trouble. This control embodies mechanisms to stop 
the computer at any step in the program and then to restart it at the 
beginning of the program. Since the number of orders executed may be 
adjusted by Simply varying a delay, the flow of information from one 
registr to another can be observed visually on an oscilloscope. Fur¬ 
thermore, the restart is somewhat delayed following a stop so this same 
flow of information between registers can be observed on flip-flop 
indicator lights grouped at the central control location. In general, 
the cyclic program control permits an operator to set up complicated 
conditions within the computer identical or equivalent to thoso of 
normal operation and at the same time obtai an outward simplicity 
that makes analysis relatively easy.' 

UvU Recorde of Intermittent Failures 


For intermittent type failures, little specialized equipment 
is available to assist in trouble location. Two features are worthy 
of mention. First, a camera has been set up so that the control panels 
can b^ photographed to show all flip-flop light indications and all 
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control switch settings at the time of an error., This makes it practical 
to preserve data on all such errors without seriously delaying applica¬ 
tions work. The photograph is supplemented by a report giving other 
details concerning the program and method of using the computer that 
might be helpful in later study of the failure. 

Since many intermittent failures are the result of poor 
connections on panels or momentary shorts within tubes, they can be 
precipitated by shock or vibration. A second feature whith helpB in 
localizing intermittent trouble is an arrangement for producing throughout 
the computer room an audible signal characteristic of the program being 
run. Aa tubes or panels are being tapped an intermittent fault is indi¬ 
cated by an interruption of thin signal, after which the program automati¬ 
cally restarts. 


5,0 TROUBLE LOCATION PROCEDURES 

A more comprehensive picture of the built-in aids Just described 
can be obtained from a description of the diagnostic procedures used. I 
will first discuss marginal checking and then will illustrate methods of 
locating sudden and intermittent faults. 

5.1 Marginal Checking 

Checking for low operating margins is a daily preventive main¬ 
tenance procedure. For the ocmolete routine, several different programs 
are used each designed to thoroughly exercise a different portion of the 
computer. The principal followed is that when one portion has passed a 
test satisfactorily it may then safely be used in checking another part 
of the computer. For example, a test is first made of the central 
control using a minimum of storage, arithmetic element, and input-output 
facilities. Next is a thorough tost of the arithmetic element, followed 
by tests of storage, and finally of the input-output element. The pro¬ 
grams for these tests are designed with as many check orders as possible 
ao that no more than a few orders can be executed after any error before 
the computer is stopped by an alarm. 

Typical operating procedure for testing a section of the com¬ 
puter is as follows. The marginal checking equipment is set for an 
automatic mode in which it selects voltage variation lines in sequence 
and applies a voltage excursion to each. The magnitude of the voltage 
excursion is preset for each line and therefore may differ from one line 
to the next. The preset values ore those that give excursions 10 percent 
less than the maxima the circuits can tolerate without failing. With 
such settings aaaitomatlc marginal checking sequence will cause no 
failures until the margin on a circuit has dropped by more than 10 
percent. If deterioration of some component causes the margin for a 
line ho drop more than 10 percent., during automatic marginal checking 
an alarm will occur which stops the equipment and permits manual 


i'a;;e 6 


i, * 

. norrirtg* !?otc liil> )t 


determination of the new failure point. The excursion is then reset 
to 10 percent below this new value and the new excursion is entered on 
a record sheet for this line. In this manner, the only data which need 
be recorded during the routine checking are those on the few lines which 
have deteriorated appreciably. Unless there has been an abnormally large 
drop in margin, no corrective action is taken during the marginal checking 
period. Instead, a weekly maintenance period is scheduled during which 
circuits whose margins are approaching a dangerously low value are inves¬ 
tigated and repaired. 

As was pointed out at the beginning of igy talk, one type of 
fault that must be dealt with in the Whirlwind machine is maladjustment 
or other weaknesses in the system resulting from installation of new 
equipment. Abnormally large changes in margins detected during a routine 
checking period is one way in which such weaknesses are made apparent. 

For example, one installation required that an existing control pulse 
be also fed into the new equipment. In order to do this the physical 
arrangement of video cables carrying this signal was changed although 
their logical function was not. After this installation several low 
margins were found which were the result of an unforeseen change in 
pulse timing caused by the change in pulse routing. 

Ihe marginal checking facilities are also valuable in trouble 
location work not related to the routine preventive maintenance, espe¬ 
cially in evaluating the performance of new circuits or ones that have 
been repaired. In a typical case an electronic switch utilising eight 
flip-flops of a new design was installed after passing exhaustive bench 
tests. In the computer system, it was found that the flip-flops showed 
low margins and several failures of the switch were reported within a 
week. Improved flip-flop circuits which gave wide margins were then 
substituted. These have operated about six months without failure. 

5.2 Sudden Failures 


For sudden or intermittent failures a somewhat different 
approach is needed. In the c S56 Of a sudden failure within the system, 
it is necessary to isolate and repair the circuit in order to get the 
system back into operation. Fortunately the procedure for doing this is 
relatively straightforward so little time is lost on the average. A 
program is inserted which shows the failure. This can be the one that 
was in use at the time the failure occurred or a simplified one designed 
on the spot which produces the same failure. With the cyclic program 
control, it is possible to quickly determine on which step the alarm 
occurs. This control periodically restarts the program and then stops 
it after an arbitrary number of orders have been executed. Usually 
an analysis of flip-flop indicator light patterns for a few steps pre¬ 
ceding the alarm will show where information is failing to transfer 
properly. Then simple signal tracing in the suspected circuits using 
the test oscilloscope and remote video probes will pinpoint the diffi¬ 
culty. 
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$.3 Intermittent Failures 


Die most troublesome failures in Whirlwind, as, I suspect, in 
any cct", a ter, are Intermittent failures, Usually the amount- of data 
available la highly Inadequate for localising the difficulty so one 
la forced to use cut and try procedures. The search for an intermittent 
starts with a study of all available reports on recant transient failuresj 
report forms filled out by users, photographs usf the indicators and 
controls taken following unexplainable errors, and any observations made 
by engineers and technicians while working on the s y ste m . From such 
information, a ted ini clan familiar with the machine logic, in general, 
can estimate what area of the computer produced the failure. He then 
inserts a program and tests the suspected components or panels by lightly 
tapping them to see if any errors are introduced. A momentary short be¬ 
tween a control grid end another element in » gate tube is an example of 
an intermittent failure which can be located quite regularly. It gener¬ 
ally will cause an output pulse from the tube even when no input pulse 
is supplied. If such a failure were suspected, the program inserted 
would be one which supplied no input signals to the tube but which checked 
for presence of output pulses from it. 

In carrying out cut and try procedures for locating intermit tents, 
the cyclic program control and marginal checking facilities may also prove 
useful There was a recent Instance where the computer showed symptomc of 
an intermittent failure which was later tracked down by means of special 
diagnostic programs and the use of the cyclic program control. After the 
trouble was located. It was obvious that marginal checking would also have 
pointed out the defect. In this instance, the symptoms indicated that a 
register occasionally was not being cleared at the proper time. •* special 
program designed to emphasise this failure was inserted. It uncovered the 
fact that the clearing operation was correct but the register was receiving 
a spurious read-in shortly after the clear pulse. This was traced to an 
improperly terminated delay line which was reflecting a delayed pulse with 
sufficient amplitude to cause the occasional read-in. However, the faulty 
delay-line condition had existed for some time. It was discovered that 
the routine action of replacing the buffer amplifier that fed the delay 
line was the direct cause of the intermittent trouble. It gave a somewhat 
higher output so the unwanted reflection occasionally exceeded the per¬ 
missible limit for noise in that circuit. If marginal checking had been 
performed on this amplifier, the line would have shown a very low margin 
so the defect could also have been readily found by that means. 


6.0 SUMMARY 


As a brief review of nor remarks, I will show some slides which 
illustrate the more significant points that have been covered. 
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(SLIDE 1 ) 

In the first slide are listed the types of failures which have 
shown up in operation and maintenance of the Whirlwind I computer: gradual 
deterioration, sudden failures, intermittent failures, and weaknesses due 
to new equipment installation. The intermittent type are most troublesome 
since who other types can be dealt with in a routine and straightforward 
manner. 

(SLIDE 2) 


The second slide is a view of a part of the computer showing 
the open type of construction used. This suggests why it in practical 
and desirabie to repair circuits in place rather than to replace panels. 
Remote video probes can be placed on any point in a circuit for viewing 
waveforms on a central test oscilloscope. As I have pointed out this has 
had an influence on the trouble location procedures that have been developed. 

(SLIDE 3) 


The next slide ehows the computer control center. The flexibility 
provided by this relay-rack type of installation permitted frequent altera¬ 
tion of the control facilities while- trouble location techniques were being 
worked out. Grouped :.n this area are the marginal checking controls, flip- 
flop indicators, alarm lights, switches for controlling the computes* opera¬ 
tion and inserting or altering its program, a computer output display scope, 
test oscilloscopes with pushbutton selection of many critical waveforms or 
signals from remote video probss, and a master Intercom station for commu¬ 
nication with uwasr computer working areas. 

(SLIDE M 


In maintenance procedures, major use is made of the marginal 
checking facilities built into the Whirlwind computer. It is used daily- 
in routine examinations of the system for deteriorating circuits. These 
daily tests provide recoras or gradual deterioration so most component 
replacement can be dene during scheduled maintenance periods. This slide 
shows a typical record of deterioration on one line. The dated entries 
are new voltage excursions set in after the program failed with the pre¬ 
vious excursion. In December 1952 the negative margin dropped to the 
danger point of 12 volts. Two tubes were replaced and the original margin 
was restored. The marginal checking equipment is also invaluable in eval¬ 
uating the performance of newly Installed equipment as wall as in isolating 
intermittent failures that inadvertently may result when installation or 
vork is dons a 

(SLIDE 5) 


Sudden failures are analysed by utilising the cyclic program 
control and observing results on indicator lights and on the test oscill¬ 
oscope. Intermittent failures require careful study of all available 
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symptoms and a shrewd estimate by an experienced operator of where to 
look for -the trouble. This slide is a typical photograph of the opera¬ 
ting console taken after an alarm. It showe indicator light patterns 
and switch settings which can be analyzed whs-n tracking down failures. 

In both of these cases the computer program used is highly significant 
but little success has been achieved in developing one that is universally 
useful. Instead it has been found that- relatively simple order sequences 
uniquely designed for the problem at hana and modified as test results 
require are a more powerful tool. 

(SLH)o 6) 


An adequate measure of the effectiveness of trouble location 
procedures in Whirlwind is difficult where new installation is continually 
being carried cut. On this slide, however, are listed some data that I 
feel have- {significance. Of the time scheduled for useful computation 
during the past year about 90 percent was usable. This figure is based 
on reports submitted by groups using the computer rather than on statements 
of personnel maintaining it. During that period there has bean an average 
of about 100 man hours of installation work psr week dene on a weekly basis. 
Twenty four hours per week of preventive maintenance is listed. About half 
of this ie routine daily checking while the remainder is test periods foll¬ 
owing installation work. Tta average length of the periods when the computer 
has boon forced out of operation during scheduled computation work is of the 
order of 20 minutes. 

Although this record may be a tolerable c.ne at present, continued 
effort is being expended to better it. Most needed ie a more powerful attack 

on the problem of intermittent, failures. 
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INDICATOR LIGHT PATTERN 
FOLLOWING AN ALARM 
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MAINTENANCE EFFECTIVENESS 


SCHEDULED TIME USEADLE 90 

INSTALLATION TIME 100 

PREVENTIVE MAINTENANCE TIME 24 


PERCENT 

MAN HRS/WEEK 

HRS/WEEK 


AVERAGE UNSCHEDULED DOWN TIME 


20 MINUTES 
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